Skip to content

Conversation

@EnricoDeg
Copy link
Contributor

@EnricoDeg EnricoDeg commented Dec 19, 2025

Proposed changes

Summary:

  • Add support for direct store in epilogue instead of cshuffle (performance improvement for small K problems)
  • Add padding support for wave transfer without transpose
  • Add wave transfer with interleaved layout to support direct store
  • Enable new functionalities on GEMMs
  • Add optional new functionality support for grouped convolution fwd
  • Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

kabrahamAMD
kabrahamAMD previously approved these changes Dec 22, 2025
Copy link
Contributor

@kabrahamAMD kabrahamAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work

index_t,
index_t)
{
// Notes: padding is currently not supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment (or remove, as it provides no additional information to the error message)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@EnricoDeg EnricoDeg force-pushed the streamhpc/remove_cshuffle branch from 461a40d to 9134cbb Compare December 23, 2025 07:05

// Limitations of the current implementation:
// - no multiAB
// - GemmSpecialization Default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like GemmSpec default is no longer a hard requirement for using WaveTransfer, but only when A B layouts are not row column respectively? If so update comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// We need to investigate if it makes sense to remove cshuffle for smaller types
// Currently we use direct store for NRepeat equal to 4 or 8. For 16 bit type we use at
// lease buffer store 64 bit for 16 contiguous threads -> 128 bytes in toral (full cache line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos: lease, toral

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@krithalith
Copy link
Contributor

LGTM, had some small notes and I have a few questions:

  • Has conv fwd performance with wavetransfer been benchmarked?
  • Has wavetransfer / directstore been benchmarked in general?
  • If no to the above two questions, should we make a new issue for this?
  • Why do we only have wavetransfer instances for grouped conv fwd vanilla 2D?
  • Where did the vanilla fwd 2d instances come from?

@krithalith
Copy link
Contributor

LGTM, had some small notes and I have a few questions:

* Has conv fwd performance with wavetransfer been benchmarked?

* Has wavetransfer / directstore been benchmarked in general?

* If no to the above two questions, should we make a new issue for this?

* Why do we only have wavetransfer instances for grouped conv fwd vanilla 2D?

* Where did the vanilla fwd 2d instances come from?

Discussed offline. Basically benchmarks have been performed and wavetransfer did seem to be a decent bit better for 2D. No significant improvement for 3D so no instances added. We want to look into more automated benchmarks later.

krithalith
krithalith previously approved these changes Dec 24, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for direct store in epilogue and padding support for wave transfer without transpose, enabling performance improvements for small K problems and grouped convolution operations on GFX12 hardware.

Changes:

  • Added direct store epilogue option that bypasses cshuffle for better performance on small K problems
  • Implemented padding support for wave transfer without transpose operations
  • Added wave transfer with interleaved layout to support direct store functionality
  • Enabled new wave transfer instances for grouped convolution forward operations (F16 and BF16)

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file
File Description
device_grouped_conv2d_fwd_wmma_cshufflev3_wave_transfer_nhwgc_gkyxc_nhwgk_f16_instance.cpp New F16 instance definitions for wave transfer grouped convolution
device_grouped_conv2d_fwd_wmma_cshufflev3_wave_transfer_nhwgc_gkyxc_nhwgk_bf16_instance.cpp New BF16 instance definitions for wave transfer grouped convolution
device_grouped_conv_fwd_wmma_cshufflev3_wave_transfer_instance.hpp Template instantiations for wave transfer instances with various tile sizes
gridwise_ab_transfer_wave_tiles_interleave.hpp New interleaved wave tile transfer implementation for direct store support
epilogue_direct_store.hpp New direct store epilogue that bypasses cshuffle for improved performance
gridwise_gemm_wmma_cshuffle_v3_common.hpp Core logic for conditional epilogue type selection and wave transfer applicability checks
gridwise_gemm_wmma_cshuffle_v3.hpp Added IsFusedKernel template parameter to control direct store usage
gridwise_ab_transfer_wave_tiles.hpp Added padding support via PadGridDescriptor helper method
epilogue_cshuffle_v3_wmma_base.hpp Added IsLDSNeeded() method for LDS usage detection
device_grouped_conv_fwd_multiple_abd_wmma_cshuffle_v3.hpp Major refactoring to pass M_K and N_K descriptors, enable runtime descriptor transformation, and force MN padding
device_grouped_gemm_wmma_splitk_cshuffle_v3.hpp Updated to use conditional epilogue type selection
device_batched_gemm_wmma_cshuffle_v3.hpp Updated to use conditional epilogue type selection
device_batched_gemm_wmma_cshuffle_v3_b_scale.hpp Updated to use conditional epilogue type selection
device_gemm_reduce_wmma_cshuffle_v3.hpp Added IsFusedKernel=true to disable direct store for fused kernels
device_gemm_multiple_d_layernorm_wmma_cshuffle_v3.hpp Added IsFusedKernel=true to disable direct store for fused kernels
device_gemm_bias_add_reduce_wmma_cshuffle_v3.hpp Added IsFusedKernel=true to disable direct store for fused kernels
CMakeLists.txt Added new source files to build configuration
grouped_convolution_forward_wmma_cshufflev3.inc Added function declarations for new wave transfer instances
grouped_convolution_forward.hpp Added instance registration calls for new wave transfer operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bartekxk
bartekxk previously approved these changes Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants